PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

Our results indicate that PandaLM-7B achieves 93.75% of GPT-3.5's evaluation ability and 88.28% of GPT-4's in terms of F1-score on our test dataset.